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Deeir  Dr.  Suddarth, 

The  following  is  the  Final  Technical  Report  for  the  grant  “Using  Modular  Neural  Net¬ 
works  With  Local  Representations  To  Control  Dynamic  Systems”  (AFOSR-89-0500) 
for  the  period  9/1/89  -  8/31/92.  I  ahso  enclose  several  papers  that  cover  the  work  done 
in  that  period. 

The  objective  of  this  research  was  to  develop  an  artificial  neural  network  with  very  fast 
learning.  Major  areas  of  activity  included  developing  cross  validation  methods  to  re¬ 
fine  parameters  such  as  the  distance  metric,  developing  parallel  versions  of  the  learning 
algorithms  which  allow  implementations  to  be  scaled  up  by  simply  adding  additional 
processing  hardware,  with  a  negligible  penalty  in  processing  time,  and  performing  nu¬ 
merical  experiments  on  simulated  data  to  test  the  approach.  We  have  also  compared 
our  approach  with  other  neural  network  approaches,  and  found  that  it  provides  equal 
or  better  performance.  We  have  implemented  versions  of  this  approach  on  several  plat¬ 
forms:  serial  computers  (standard  Sun  workstations),  digited  signal  processors  (Intel 
i860),  a  parallel  computer  (Connection  Machine),  and  using  special  purpose  digital 
circuitry.  We  showed  that  we  can  perform  training  and  access  sufficiently  quickly  to 
allow  real-time  learning,  with  a  real-time  implementation  of  memory-based  learning 
on  a  juggling  task.  We  have  also  explored  alternative  methods  such  eis  radial  basis 
functions  and  projection  pursuit,  and  developed  a  more  demanding  experimental  task: 
helicopter  control. 
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The  funds  provided  by  the  Air  Force  had  a  large  impact  on  education  as  well  as  on 
research,  providing  partied  support  for  several  graduate  students  and  undergraduates. 
Ying  Zhao  (Mathematics,  a  woman)  receiv'ed  her  PhD  from  the  MIT  Mathematics 
Department  for  work  done  under  this  grant.  She  was  also  supervised  by  Prof.  Peter 
Huber  in  the  Mathematics  Department.  Sherif  Botros  (Brain  and  Cognitive  Sciences) 
explored  using  radial  basis  functions  for  motor  learning  and  optimization.  Gertie  Van 
Zyl  (Mechanical  Engineering)  built  the  juggling  robot  robot  testbed  for  lectrning  meth¬ 
ods.  Gideon  Stein  (Electrical  Engineering  and  Computer  Science)  explored  digital 
implementation  technologies  for  our  learning  approach  and  programmed  the  Intel  i860. 
Chiqian  Xie  (Mechanical  Engineering)  worked  on  the  leziming  control  of  flexible  objects, 
an  important  and  difficult  clciss  of  learning  problems  since  the  order  (dimensionality  of 
the  state  vector)  is  not  known.  Stefano  D’Aquino  (Electrical  Engineering  and  Computer 
Science)  explored  analog  implementation  technologies.  Several  undergraduates  received 
prizes  for  their  thesis  work  under  this  grant,  including  Peter  Gordon  (“An  Associative 
Content  Addressable  Memory”,  S.B.  in  Electrical  Engineering  and  Computer  Science, 
May  1990.  Awarded  Second  Prize  in  the  David  Adler  Memorial  Thesis  Competition 
by  the  Undergraduate  Thesis  Committee  of  the  DepMtment  of  Electrical  Engineering 
and  Computer  Science),  and  Paul  Sajda,  (“Machine  Implementation  of  a  Human  Mo¬ 
tor  Task:  The  Yo-Yo  Robot”,  S.B.  in  Electrical  Engineering  and  Computer  Science, 
May  1989.  Awarded  Second  Prize  in  the  David  Adler  Memorial  Thesis  Competition 
by  the  Undergraduate  Thesis  Committee  of  the  Depeu’tment  of  Electrical  Engineering 
and  Computer  Science).  The  students  played  a  major  role  in  the  research,  and  also 
learned  a  great  deal.  This  research  has  also  been  incorporated  in  several  courses,  in¬ 
cluding  courses  on  motor  learning,  computational  approaches  to  motor  learning,  and 
nonparametric  regression  applied  to  learning  and  optimization. 


Sincerely, 


Christopher  G.  Atkeson 
Associate  Professor 


cc:  Dr.  Harry  Klopf 


Final  Report:  Using  Modular  Neural 
Networks  With  Local 
Representations  To  Control  Dynamic 

Systems 

Christopher  G.  Atkeson 

The  objective  of  this  research  was  to  explore  an  artificial  neural  network 
architecture  that  simply  remembers  experiences  and  builds  local  models  to 
answer  particular  queries.  The  reason  we  are  interested  in  a  memory-biised 
approach  is  that  it  offers  the  possibility  of  fast  training  and  minimal  inter¬ 
ference.  The  approach  is  to  model  complex  systems  using  many  simple  local 
models.  This  approach  avoids  the  difficult  problem  of  finding  an  appropriate 
structure  for  a  global  model.  To  implement  this  approach  a  two-part  artificial 
neural  network  is  designed.  One  part  uses  a  local  representation  to  remember 
the  training  data  set,  and  the  second  part  is  trained  on  selected  portions  of 
the  training  set  to  form  local  models  as  needed.  This  network  architecture 
can  be  simulated  using  k-d  tree  data  structures  on  standard  serial  computers 
and  also  using  parallel  search  on  a  massively  parallel  computer  such  as  the 
Connection  Machine.  The  performance  of  the  network  was  initially  evalu¬ 
ated  by  using  it  to  control  simulated  dynamic  systems.  The  ultimate  goal  is 
to  demonstrate  successful  control  of  actual  dynamic  systems  such  as  a  robot 
helicopter. 

One  scientific  contribution  was  to  develop  and  demonstrate  the  utility  of 
non- parametric  methods  for  adaptive  control.  Memory-based  control  pro¬ 
vides  a  new  and  possibly  better  method  for  solving  control  problems  such 
as  the  robot  trajectory  following  problem.  Another  contribution  is  to  the 
study  of  learning  in  intelligent  systems.  Demonstrating  that  learning  a  local 
representation  is  fast  and  effirient  and  that  the  problems  of  limited  memory 
capacity  can  be  overcome  is  an  important  contribution  to  the  continuing  de¬ 
bate  between  proponents  of  local  and  distributed  representations.  This  is  of 
great  relevance  to  neuroscientists  who  study  biological  systems,  as  well  as  to 
computer  designers.  This  research  will  also  contribute  to  our  understanding 
of  human  learning.  We  currently  do  not  understand  how  humans  learn  mo¬ 
tor  tasks.  Comparing  the  behavior  of  memory-based  learning  algorithms  to 
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actual  human  learning  may  give  us  insight  into  how  humans  learn  and  how 
they  might  be  taught  more  effectively. 

Memory-based  learning  provides  one  approach  to  fast  training,  in  that 
the  representation  is  trairu'd  by  storing  experiences  in  a  large  memory.  This 
corresponds  to  winner  take  all  netsvorks  which  use  a  local  representation, 
and  can  be  used  to  perform  nearf'st  neighbor  search  for  relevant  experiences. 
Locally  weighted  regression  is  a  form  of  memory-based  learning  in  which  a 
model  is  fit  to  relevant  experiences  in  order  to  make  a  prediction.  Locally 
weighted  regression  approximates  complex  functions  using  simple  local  mod¬ 
els,  as  does  a  Taylor  series.  During  training,  experiences  are  stored  in  a 
memory.  When  a  query  must  be  answered,  experiences  relevant  to  the  query 
are  found  and  combined  to  form  a  local  model.  Examples  of  types  of  local 
models  include  nearest  neighbor,  weighted  average,  and  locally  weighted  re¬ 
gression.  Each  of  these  local  models  combine  points  near  to  a  query  point 
to  estimate  the  appropriate  output.  Locally  weighted  regression  uses  a  rel¬ 
atively  complex  regression  procedure  to  form  this  model,  and  is  thus  more 
expensive  than  nearest  neighbor  and  weighted  average  memory-based  learn¬ 
ing  procedures.  For  each  rpiery  a  new  local  model  is  formed.  The  rate  at 
which  local  models  can  be  formed  and  evaluated  limits  the  rate  at  which 
queries  can  be  answered.  However,  we  have  found  that  locally  weighted  re¬ 
gression  can  be  implemented  in  real  time,  and  it  has  been  implemented  for 
online  robot  learning  of  a  challenging  control  task  (a  juggling  task  known 
as  “devil  sticking”).  We  used  ronrmiercially  available  microprocessors  (Intel 
i860  and  Texas  Instrument  s  T.MS320C30).  We  have  also  found  that  memory- 
beised  learning  avoids  int('rference  between  new  and  old  data  by  retaining  and 
using  all  the  data  to  answer  each  query. 

To  illustrate  how  the  l.nvil  model  is  formed  in  locally  weighted  regression 
we  will  first  consider  a  global  model  formed  using  unweighted  regression. 
The  inputs  of  each  trammg  data  point  form  a  row  m  the  matrix  X,  and 
the  outputs  form  a  corresponding  row  in  the  vector  y.  The  structure  of  the 
model  is  chosen  so  it  is  lim  ai  m  the  unknown  parameters,  which  appear  in 
the  vector  and  are  relat.  ii  by  the  set  of  linear  equations 

X.?  =.  y 

Since  there  are  more  equations  (data  points)  than  unknown  parameters,  the 
parameters  0  are  chosen  by  minimizing  the  sum  of  the  squared  fitting  errors. 
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min{X/?-y)’'(X,j3-y) 

3 

This  sum  is  minimized  by  the  solution  of  the  normal  equations; 

(x^x)  3  =  X^y 

We  can  think  of  this  process  ais  minimizing  the  energy  of  a  set  of  identical 
springs  connecting  the  data  points  to  the  model  surface.  A  problem  with 
unweighted  regression  is  that  points  distant  to  the  query  point  have  as  much 
influence  on  the  answer  to  the  query  as  nearby  points  (for  equally  spaced 
data). 

Locally  weighted  regression  reduces  the  influ'^nce  of  distant  points  on 
the  query  answer  by  weighting  the  data  according  to  its  distance  from  the 
query  point.  In  order  to  do  this  one  needs  to  know  what  the  query  point  is 
and  to  have  a  distance  metric  and  a  weighting  function  that  transforms  the 
calculated  distance  into  a  weight: 

-  q,  )* 

j 

Wi  -  d~^ 

Each  row  of  X  and  y  is  multiplied  by  the  weight  for  that  point,  u?,.  We 
can  think  of  locally  weighted  regression  as  minimizing  the  energy  of  a  set  of 
springs  whose  spring  constants  decrease  with  distance  from  the  query  point. 

Often  there  is  not  enough  data  in  all  directions,  which  leads  to  an  ill- 
conditioned  regression  problem.  The  estimates  are  stabilized  by  adding  small 
positive  numbers  to  the  diagonal  of  the  X^X  matrix.  This  technique  is 
known  as  ridge  regression  in  statistics.  It  is  equivalent  to  adding  fake  data 
in  each  direction  that  has  a  small  weight  and  a  zero  output  value.  The  ridge 
regression  constants  can  also  be  thought  of  as  Bayesian  priors  on  the  variance 
of  the  estimated  parameter  vector  3. 

(X'-X  +  a)  /?  =  X^y 

We  use  off-line  cross  validation  to  estimate  reasonable  values  for  the  fit 
parameters:  the  distance  metric  m^,  weighting  function  Wi  =  d”'*,  and  ridge 
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regression  parameters  Aj.  Since  we  are  using  a  local  model  that  is  linear  in  the 
unknown  parameters  /?,  we  can  compute  derivatives  of  the  cross  validation 
error  e,  =  y,  —  y;  with  respect  to  the  fit  parameters: 

de,  de,  de, 
drrij  dp  dXj 

and  minimize  the  sum  of  the  squared  cross  validation  error  using  a  Levenberg- 
Marquardt  procedure. 

Lookup  has  three  stages;  forming  weights,  forming  the  regression  matrix, 
and  solving  the  normal  equations.  Let  us  examine  how  the  cost  of  each  of 
these  stages  grows  with  the  size  of  the  data  set  and  dimensionality  of  the 
problem.  We  will  assume  a  linear  local  model. 

Forming  and  applying  the  weights  involves  scanning  the  entire  data  set, 
so  it  scales  linearly  with  the  number  of  data  points  in  the  database  (n).  For 
each  of  d  input  dimensions  there  are  a  constant  number  of  operations,  so  the 
number  of  operations  scales  linearly  with  the  number  of  input  dimensions. 

Note  that  we  can  eliminate  points  whose  distance  is  above  a  threshold, 
reducing  the  number  of  points  considered  in  subsequent  stages  of  the  com¬ 
putation. 

Each  element  of  X^X  and  X^y  is  the  inner  (dot)  product  of  two  columns 
of  X  or  y.  The  architecture  of  digital  signal  processors  is  ideally  suited  for 
this  computation,  which  consists  of  repeated  multiplies  and  accumulates. 
The  computation  is  linear  in  the  number  of  rows  n  and  quadratic  in  the 
number  of  columns  +  d  *  o),  where  d  is  the  number  of  input  dimensions 
and  o  is  the  number  of  output  dimensions. 

Solving  the  normal  equations  is  done  using  a  LDL^  decomposition,  which 
is  cubic  in  the  number  of  input  dimensions,  and  independent  of  the  number  of 
data  points.  Other  more  sophisticated  and  more  expensive  decompositions, 
such  as  the  singular  value  decomposition,  do  not  need  to  be  used  since  the 
ridge  regression  procedure  guarantees  that  the  normal  equations  will  be  well- 
conditioned  and  this  cost  is  small  compared  with  forming  X^X. 

The  most  straightforward  parallel  implementation  of  locally  weighted  re¬ 
gression  would  distribute  the  data  points  among  several  processors.  Queries 
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can  be  broadcast  to  the  processors,  and  each  processor  can  weight  its  data 
set  and  form  its  contribution  to  X^X  and  X^y.  These  contributions  can  be 
summed  and  the  full  normal  equations  solved  on  a  single  processor.  The  com¬ 
munication  costs  are  logarithmic  in  the  number  of  processors  and  quadratic 
in  the  number  of  columns  {d^  d  *  o),  and  independent  of  the  total  number 
of  points. 

We  have  implemented  the  local  weighted  regression  procedure  on  a  33MHz 
Intel  i860  microprocessor.  The  peak  computation  rate  of  this  processor  is 
66  MFlops.  We  have  achieved  effective  computation  rates  of  20  MFlops 
on  a  learning  problem  with  10  input  dimensions  and  5  output  dimensions, 
using  a  linear  local  model.  This  leads  to  a  lookup  time  of  approximately  20 
milliseconds  on  a  database  of  1000  points. 

A  question  that  often  arises  with  memory-based  models  is  the  effect  of 
memory  limitations.  We  have  not  yet  needed  to  address  this  issue  in  our 
experiments.  However,  we  plan  to  explore  how  memory  use  can  be  minimized 
based  on  several  approaches.  One  approach  is  to  only  store  “surprises”.  The 
system  would  try  to  predict  the  outputs  of  a  data  point  before  trying  to  store 
it.  If  the  prediction  is  good,  it  is  not  necessary  to  store  the  point.  Another 
approach  is  to  forget  data  points.  Points  can  be  forgotten  or  removed  from 
the  database  based  on  age,  proximity  to  queries,  or  other  criteria.  Because 
memory-based  learning  retains  the  original  training  data,  forgetting  can  be 
explicitly  controlled. 

1  Performance  Comparisons 

Two  methods,  CMAC  (.Albus  1975ab)  and  sigmoidal  feedforward  neural  net¬ 
works,  were  compared  to  the  approach  explored  in  this  paper.  The  parame¬ 
ters  for  the  CMAC  approach  were  taken  from  Miller,  Glanz,  and  Kraft  (1987) 
who  used  the  CMAC  to  model  arm  dynamics.  The  architecture  for  the  sig¬ 
moidal  feedforward  neural  network  was  taken  from  Goldberg  and  Pearlmutter 
(1988,  section  6)  who  also  modeled  arm  dynamics. 

The  ability  of  each  of  these  methods  to  predict  the  torques  of  the  sim¬ 
ulated  two  joint  arm  at  1000  random  points  was  compared.  Figure  1  plots 
the  normalized  RMS  prediction  error.  The  points  were  sampled  uniformly 
using  ranges  comparable  to  those  used  in  (Miller  et  al  1987).  Initially,  each 
method  was  trained  on  a  training  set  of  1000  random  samples  of  the  two  joint 
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Figure  1;  Performance  of  various  methods  on  two  joint  arm  dynamics. 


arm  dynamics  function,  and  then  the  predictions  of  the  torques  on  a  separate 
test  set  of  1000  random  samples  of  the  two  joint  arm  dynamics  function  were 
assessed  (points  1,  3,  and  5).  Each  method  was  then  trained  on  10  attempts 
to  make  a  particular  desired  movement.  Each  method  successfully  learned 
the  desired  movement.  After  this  second  round  of  training,  performance  on 
the  random  test  set  was  again  measured  (points  2,  4,  and  6). 

The  data  indicate  that  the  locally  weighted  regression  approach  (filled 
in  circles)  and  the  sigmoidal  feedforward  network  approach  (asterisks)  both 
generalize  well  on  this  problem  (points  3  and  5  have  low  error).  The  CMAC 
(diamonds)  did  not  generalize  well  on  this  problem  (point  1  has  a  large  error), 
although  it  represented  the  original  training  set  with  a  normalized  RMS  error 
of  0.000001.  A  variety  of  CMAC  resolutions  were  explored,  ranging  from  a 
biisic  CMAC  cell  size  covering  the  entire  range  of  data  to  a  cell  size  covering 
a  fifth  of  the  data  range  in  each  dimension.  A  cell  size  covering  one  half  the 
data  ranges  in  each  dimension  generalized  best  (the  data  shown  here). 

After  training  on  a  different  training  set  (the  attempts  to  make  a  par¬ 
ticular  desired  movement),  the  sigmoidal  feedforward  neural  network  lost  its 
memory  of  the  full  dynamics  (point  4),  and  represented  only  the  dynamics 
of  the  particular  movements  being  learned  in  the  second  training  set.  This 
interference  between  new  and  previously  learned  data  was  not  prevented  by 
increasing  the  number  of  hidden  units  in  the  single  layer  network  from  10 
up  to  100.  The  other  methods  explored  did  not  show  this  interference  effect 
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(points  2  and  6). 


2  Other  Methods  Explored 

In  the  pursuit  of  fast  train!  nsj,  met  hods  we  explored  a  variety  of  techniques  for 
fast  or  real  time  function  a[)[)roximation,  including  radial  basis  functions  and 
projection  pursuit  regression.  A  graduate  student,  Sherif  Botros,  looked  at 
ways  to  speed  up  radial  basis  function  based  learning,  and  make  it  more  effec¬ 
tive  (Botros  and  Atkeson.  1!)*)1).  Radial  basis  functions  are  a  form  of  neural 
network  model  in  which  the  liidden  units  are  multidimensional  “bumps". 
The  function  to  be  learned  is  approximated  by  summing  the  bumps: 

/(x)  =  X^c.<7(||x-x.||) 

For  fixed  bump  locations  and  shapes,  estimating  c,  is  a  linear  regression  prob¬ 
lem,  making  this  approacli  attractive  for  linear  adaptive  control  methods.  In 
our  research  we  have  found  that  the  choice  of  distance  metric  is  critical.  We 
have  found  heuristics  for  estimating  good  initial  metrics,  which  can  be  refined 
using  nonlinear  parameter  estimation  techniques.  Using  these  techniques  we 
have  found  that  radial  basis  functions  are  currently  the  most  effective  neural 
network  approach  to  mudeling  robot  arm  rigid  body  dynamics.  A  remain¬ 
ing  challenge  is  that  tin-  II RF  approach  generally  leads  to  a  large  regression 
problem,  which  is  diffu  ult  to  implement  in  real  time. 

Another  graduate  student.  Mng  Zhao,  tried  to  understand  learning  based 
on  projection  pursuit  rcuit  ssion  (Zhao  and  Atkeson,  1991a,  1991b).  Projec¬ 
tion  pursuit  regression  is  a  form  of  neural  network  model  in  which  the  hidden 
units  are  general  one  dimensional  functions  rather  than  sigmoids: 

/'X)  =  ^9i{0fx) 

One  can  view  one  hidden  l.iver  sigmoidal  neural  networks  as  a  specialization 
of  projection  pursuit  neiwoiks: 

/;x)  =  -f-  ^.) 

We  are  exploring  heuristus  to  choose  good  initial  directions  (^,)  for  hidden 
units.  We  have  also  found  that  projection  pursuit  learning  networks  work 
better  on  angular  smooth  functions  than  on  Laplacian  smooth  functions. 
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Here  “work  better”  means  that  for  fixed  complexities  of  hidden  unit  functions 
and  a  certain  approximation  accuracy  requirement  fewer  hidden  units  are 
required;  or  given  a  fixed  number  of  hidden  units  a  better  accuracy  can  be 
achieved.  As  of  yet  we  have  no  real  time  implementation  of  projection  pursuit 
learning  networks. 


8 


